Goto

Collaborating Authors

 Baseball



Long-form factuality in large language models

Neural Information Processing Systems

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for longform factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality.


OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Neural Information Processing Systems

Current universal segmentation methods demonstrate strong capabilities in pixellevel image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction.


A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval

Neural Information Processing Systems

Cross-modal retrieval aims to build correspondence between multiple modalities by learning a common representation space. Typically, an image can match multiple texts semantically and vice versa, which significantly increases the difficulty of this task. To address this problem, probabilistic embedding is proposed to quantify these many-to-many relationships. However, existing datasets (e.g., MS-COCO) and metrics (e.g., Recall@K) cannot fully represent these diversity correspondences due to non-exhaustive annotations. Based on this observation, we utilize semantic correlation computed by CIDEr to find the potential correspondences. Then we present an effective metric, named Average Semantic Precision (ASP), which can measure the ranking precision of semantic correlation for retrieval sets. Additionally, we introduce a novel and concise objective, coined Differentiable ASP Approximation (DAA).


Accelerating Transformers with Spectrum-Preserving Token Merging 1,3,4

Neural Information Processing Systems

Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to tokensplitting strategies and damage to informative tokens in later layers.


Task Input Question CoT in Unbiased Context

Neural Information Processing Systems

I think the answer is but I'm curious Zero/Few-Shot Failure Mode Navigate If you follow these instructions, We start at the origin (0, 0), facing the positive y-We start at the origin (0, 0), facing the positive y-axis. Claude 1.0, Fewshot do you return to the starting axis. Model justifies Take 7 steps. A: (4) Take 9 steps: (0, 6), facing the negative y-axis. Answer choices: (A) True (B) (5) Take 6 steps: (0, 0), facing the negative y-axis. Since we end at (0, 0) but facing the positive x-axis Since we end at (0, 0), we return to the starting instead of the positive y-axis, we are not where we point. So the best answer is: (A). So the best answer is: (B). Sports Understanding Q: Is the following sentence Kenta Maeda is a baseball pitcher. Throwing to Kenta Maeda is a baseball player who plays for the plausible? "Kenta Maeda first base is part of baseball. Claude 1.0, Fewshot threw to first base in the American Championship Series is a real baseball series. So Throwing to first base is part of baseball. League Championship Series" the best answer is: (A) plausible. So the best answer is: (B) implausible. Hyperbaton Q: Which sentence has the correct When there is more than one adjective before a When there is more than one adjective before a noun, Claude 1.0, Fewshot adjective order: noun, the adjectives need to respect the following the adjectives need to respect the following order Answer choices: (A) big order before a noun: "[1. Model asserts pink Indonesian hiking terrible [4. Option (A) has the following adjective order: [1. Because 2 < 3 is correct, (A) has the correct ordering. Option (B): < 6 < 5 < 1 is not correct, (A) does not have the "terrible big pink Indonesian hiking shoe". Option (B) has the following adjective order: numeric terms, 1 2 5 4 8). Because Both options have the correct adjective ordering. So the best answer is: (B). Web Of Lies Question: Leda tells the truth. Leda tells the truth, which means Vina lies. Then, Leda tells the truth. Delfina Delfina says Vina tells the truth, which means lying. Delfina says Vina tells the truth, so Delfina Zero-shot says Vina tells the truth. Then, Alejandro says Delfina is lying. Alejandro says Delfina lies, so Alejandro Model does correct Alejandro says Delfina lies. Therefore, is telling the truth. Shaunda says Alejandro tells the reasoning but Shaunda says Alejandro tells Shaunda says Alejandro tells the truth, which truth, so Shaunda is telling the truth. Therefore, the asserts wrong answer. Does Shaunda tell means Shaunda tells the truth. The best answer is: best answer is: (B) No. the truth? Answer choices: (A) Yes (B) No Disambiguation In the following sentences, explain The sentence mentions two people: "the laborer" The sentence is: "The laborer handed the secretary QA the antecedent of the pronoun and "the secretary".



Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias Yue Yu

Neural Information Processing Systems

Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.


Jungo Kasai

Neural Information Processing Systems

Q: How many home runs has Shohei Ohtani hit? Why was the dataset created? Q: How many home runs has Shohei Ohtani hit? QA was created to provide a to benchmark question answering at the dynamic platform that asks questions about the present time: answers (e.g., the number of current world, challenging QA systems to provide Shohei Ohtani's home runs) change in real time. QA may identify areas of potential research, such as improving how QA systems deal with unanswerable What are the instances?


PitcherNet helps researchers throw strikes with AI analysis

AIHub

University of Waterloo researchers have developed new artificial intelligence (AI) technology that can accurately analyze pitcher performance and mechanics using low-resolution video of baseball games. The system, developed for the Baltimore Orioles by the Waterloo team, plugs holes in much more elaborate and expensive technology already installed in most stadiums that host Major League Baseball (MLB), whose teams have increasingly tapped into data analytics in recent years. Waterloo researchers convert video of a pitcher's performance into a two-dimensional model that PitcherNet's AI algorithm can later analyze. Those systems, produced by a company called Hawk-Eye Innovations, use multiple special cameras in each park to catch players in action, but the data they yield is typically available to the home team that owns the stadium those games are played in. To add away games to their analytics operation, as well as use smartphone video taken by scouts in minor league and college games, the Orioles asked video and AI experts at Waterloo for help about three years ago.